<<<<<<< HEAD ======= >>>>>>> 9313cefb098722bde3c9474f288dd5836cc101a6 <<<<<<< HEAD ======= >>>>>>> 9313cefb098722bde3c9474f288dd5836cc101a6
<<<<<<< HEAD github header ======= github header >>>>>>> 9313cefb098722bde3c9474f288dd5836cc101a6

Group 1 Midterm

<<<<<<< HEAD ======= >>>>>>> 9313cefb098722bde3c9474f288dd5836cc101a6
<<<<<<< HEAD

1 Introduction

Consumer preferences are important in decision-making for companies. The way consumers rank bundles of goods and services according to the levels of utility they are being provided with is an interesting metric to understand how companies have more power over others in a competitive market. In this project, Streaming Platforms are of interest as they play an important role in a more digital and working from home (HFM) alternative that families have nowadays. The pandemic certainly accentuated the consumption of this service and during this period it was evident the massive use of different platforms. As a way to understand this competitive market, this project focuses on understanding the patterns for each Streaming Platform, population target, various rankings for shows and how are these indicators correlated. In this sense, throughout the research it was found that the world’s largest entertainment giants have ventured on streaming entertainment, including Netflix, Hulu, Prime Video, and Disney+.

=======

1 Introduction

Consumer preferences are important in decision-making for companies. The way consumers rank bundles of goods and services according to the levels of utility they are being provided with is an interesting metric to understand how companies have more power over others in a competitive market. In this project, Streaming Platforms are of interest as they play an important role in a more digital and working from home (HFM) alternative that families have nowadays. The pandemic certainly accentuated the consumption of this service and during this period it was evident the massive use of different platforms. As a way to understand this competitive market, this project focuses on understanding the patterns for each Streaming Platform, population target, various rankings for shows and how are these indicators correlated. In this sense, throughout the research it was found that the world’s largest entertainment giants have ventured on streaming entertainment, including Netflix, Hulu, Prime Video, and Disney+.

>>>>>>> 9313cefb098722bde3c9474f288dd5836cc101a6

1.1 Background

Various research has been found in this regard. For instance, JustWatch is an international streaming guide that helps over 20 million users per month to find something to watch on Netflix, Prime Video, Disney+, and other streaming platforms. This search engine for digital media is available in 60 countries, and the data is based on the interest its users show in streaming services. Some analyses have been done through this platform, proving streaming catalogs, which have continuously shifted and changed over the years. This changing scenario is interesting as it is showing the shift in preferences that users have throughout time.

Additionally, several tools were found about Market shares of selected Subscription video-on-demand (SVOD) services in the United States. Statista is a combined provider of market research as well as research and analysis services, which has concluded that Netflix’s market share on the U.S. SVOD market decreased from 29 percent in 2019 to 20 percent in 2020 due to new platforms like Peacock and HBO Max entering the market last year. However, Netflix still leads the video streaming world, followed by Amazon Prime with a 16 percent market share (Statista, 2021)

In this sense, a platform that included all the features to be reviewed was chosen for the project. The source of the dataset is Kaggle Sample Dataset where it was extracted as a CSV format. The data consists of 5367 observations and 11 variables (ID, Title, Year, Age, IMDb, Rotten Tomatoes, Netflix, Hulu, Prime Video) The dataset constitutes data of various types like numerical and categorical.

<<<<<<< HEAD

1.2 Description of the dataset

In order to make the dataset manageable, only 9 variables out of the 11 provided were used. In this sense, the Title variable stands for the name of the TV show; Year refers to the when the TV show was produced; Age refers to the target age group which goes from 7+, 16+, 18+, and all; IMDb is the rating for TV shows and it is structured 1 over 10 (1/10); Rotten tomatoes is the percentage of professional critic reviews that are positive for a given film or TV show and it is structured 1 over 100 (1/100). The last four variables are the streaming platforms that are being studied in this project: Netflix, Hulu, Primer Video, Disney+, and these categorical variables respond to 1 if the show is found in the platform or 0 otherwise.

The project consists of 5368 observations, where 3089 missing values correspong to variables ages (2127) and IMDb (962).

1.3 SMART questions

  1. What are the most targeted age groups for the TV shows by Netflix, Hulu, PrimeVideo, Disney+?

  2. Which year published the highest number of TV shows?

  3. Which streaming platform has the highest average rating (according to Rotten Tomatoes and IMDb)?

  4. What is the relationship between IMDb and Rotten Tomatoes?

2 Understanding the data

2.1 Dataset Summary

Importing the Dataset assigning “NA” to all blank cells

## 'data.frame':    5368 obs. of  9 variables:
##  $ Title          : chr  "Breaking Bad" "Stranger Things" "Attack on Titan" "Better Call Saul" ...
##  $ Year           : int  2008 2016 2013 2015 2017 2005 2013 2010 2011 2020 ...
##  $ Age            : chr  "18+" "16+" "18+" "18+" ...
##  $ IMDb           : num  9.4 8.7 9 8.8 8.8 9.3 8.8 8.2 8.8 8.6 ...
##  $ Rotten.Tomatoes: num  100 96 95 94 93 93 93 93 92 92 ...
##  $ Netflix        : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Hulu           : int  0 0 1 0 0 0 0 0 0 0 ...
##  $ Prime.Video    : int  0 0 0 0 0 1 0 0 0 0 ...
##  $ Disney.        : int  0 0 0 0 0 0 0 0 0 0 ...

2.2 Cleaning the Dataset

  • Dropped variables X1 and ID

  • Assigned “NA” to all blank cells (specifically NA for Age)

  • Replacement of substrings with gsub(old, new, string) function for variables IDMb and Rotten Tomatoes

  • Turned the variables for the streaming platforms into as.factor()

  • Counted missing values for each variable

2.3 Just undertanding a little more about streaming platform variables and age

2.3.1 Age:

## 
## 13+ 16+ 18+  7+ all 
##   9 995 854 831 552

2.3.2 Netflix:

## 
##    0    1 
## 3397 1971

2.3.3 Hulu:

## 
##    0    1 
## 3747 1621

2.3.4 Prime Video:

## 
##    0    1 
## 3537 1831

2.3.5 Disney:

## 
##    0    1 
## 5017  351

3 Exploratory Data Analysis

3.1 Smart Question: What are the most targeted age groups for the TV shows by Netflix, Hulu, Prime, disney Video?

people who are 16 and older are the most targeted age groups for the tv shows among the all steamming platfrom.

3.2 Smart Question: Which year published the highest number of TV shows?

The highest number of TV shows were published in 2017 (685) and 2018 (562).And the histogram is right skewed which indicates that video publication is raising while time goes forward.

3.3 Normality check for IMDb and Rotten Tomatoes

We have found the average value of IMDb and Rotten Tomatoes rating. Now, we want to check whether the samples of these two variables are normally distributed or not. If it is normally distributed the mean and median of the variable will be the same.

3.3.1 Normality check for the variables IMDb and Rotten Tomatoes for Netflix

=======

1.2 Description of the Dataset

In order to make the dataset manageable, only 9 variables out of the 11 provided were used. In this sense, the Title variable stands for the name of the TV show; Year refers to the when the TV show was produced; Age refers to the target age group which goes from 7+, 16+, 18+, and all; IMDb is the rating for TV shows and it is structured 1 over 10 (1/10); Rotten tomatoes is the percentage of professional critic reviews that are positive for a given film or TV show and it is structured 1 over 100 (1/100). The last four variables are the streaming platforms that are being studied in this project: Netflix, Hulu, Primer Video, Disney+, and these categorical variables respond to 1 if the show is found in the platform or 0 otherwise.

The project consists of 5368 observations, where 3089 missing values correspong to variables ages (2127) and IMDb (962).

2 Understanding the Data

2.1 Dataset Summary

Importing the Dataset assigning “NA” to all blank cells

## 'data.frame':    5368 obs. of  9 variables:
##  $ Title          : chr  "Breaking Bad" "Stranger Things" "Attack on Titan" "Better Call Saul" ...
##  $ Year           : int  2008 2016 2013 2015 2017 2005 2013 2010 2011 2020 ...
##  $ Age            : chr  "18+" "16+" "18+" "18+" ...
##  $ IMDb           : num  9.4 8.7 9 8.8 8.8 9.3 8.8 8.2 8.8 8.6 ...
##  $ Rotten.Tomatoes: num  100 96 95 94 93 93 93 93 92 92 ...
##  $ Netflix        : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Hulu           : int  0 0 1 0 0 0 0 0 0 0 ...
##  $ Prime.Video    : int  0 0 0 0 0 1 0 0 0 0 ...
##  $ Disney.        : int  0 0 0 0 0 0 0 0 0 0 ...

2.2 Cleaning the Dataset

  • Dropped variables X1 and ID

  • Assigned “NA” to all blank cells (specifically NA for Age)

  • Replacement of substrings with gsub(old, new, string) function for variables IDMb and Rotten Tomatoes

  • Turned the variables for the streaming platforms into as.factor()

  • Counted missing values for each variable

3 Exploratory Data Analysis

3.1 Smart Question: What are the most targeted age groups for the TV shows by Netflix, Hulu, Prime, disney Video?

People who are 16 and older are the most targeted age groups for TV shows among all streaming platforms.

3.2 Smart Question: Which year published the highest number of TV shows?

The highest number of TV shows was published in 2017 (685) followed by 2018 (562). The histogram is right skewed, which indicates that TV shows publications are raising while time goes forward.

3.3 Rating Systems

3.3.1 Frequency of Ratings

First, we wanted to use exploratory data analysis (EDA) to learn more about the two rating systems: IMDb and Rotten Tomatoes.

IMDb rating Rotten.Tomatoes rating
Mean 7.086 47.22
Median 7.3 48
Mode 7.4 10

IMDb is a rating scale from 1 to 10, but the lowest rating on this list is 1.1 and the highest is 9.6. Rotten Tomatoes rates TV shows on a scale from 1-100 with a lowest score of 10 and a highest score of 100. As seen in the histogram plot, the distribution of IMDb ratings has a slight left skew. This is further exemplified by the fact that the median (blue) is larger than the mean (red). The Rotten Tomatoes ratings mean and median are almost equal, but there are some outliers in the data at the lower end of the rating system. The mode is a 10/100 with 304 shows receiving that rating. The Rotten Tomatoes rating is a combination of critics’ ratings and audience ratings, but this data set only shows the total rating, which is a limitation of this dataset. It would have been interesting to see how critics and the audience agree or disagree about certain ratings. In comparing these ratings distribution, it became obvious that IMDb, on average, gives higher ratings than Rotten Tomatoes. IMDb has a mean of 7.09/10 (70.9%) and a median of 7.3/10 (73%) while Rotten Tomatoes has a mean of 47.2/100 (47.2%) and a median of 48/100 (48%). This discrepancy was surprising since we expected the rating systems to generally agree.

3.3.2 Comparison of Rating Systems

In order to further explore this unexpected discrepancy, we created a scatter plot comparing IMDb and Rotten Tomatoes. We also added the dimension of Age, which is the intended age group of each show. This allows us to visualize how shows for different age groups are rated.

Mean IMDb rating Mean Rotten.Tomatoes rating
7+ 7.013 55.031
13+ 6.833 54.222
16+ 7.248 60.307
18+ 7.297 62.667
all 6.853 47.661
NA 6.959 31.701

This scatter plot further confirmed the fact that IMDb had higher ratings when compared to Rotten Tomatoes. There is a positive correlation between the two rating systems, but Rotten Tomatoes consistently has a lower overall rating. This helped us to form a new SMART question: what is the relationship between IMDb and Rotten Tomatoes?

This plot also gave us some information about how shows for different age groups are rated. Shows intended for 16+ had the highest overall rating, with an average of 7.3/10 on IMDb and 62.7/100 on Rotten Tomatoes. The 13+ age group at the lowest average IMDb score of 6.83/10 and shows intended for all ages had the lowest Rotten Tomatoes score with 47.7/100. It should be noted that shows with no intended age group listed (NA) had a lower Rotten Tomatoes score of 31.7/100, but since that represents a dearth of data concerning age group, the subset was neglected in future analysis.

3.4 Streaming Platforms

3.4.1 Ratings for Platforms

After comparing the rating systems to each other and then seeing how the intended age group affects ratings, we then turned to the different platforms. We compared the ratings for Disney, Hulu, Netflix, and Prime using boxplots.

Mean IMDb rating Mean Rotten.Tomatoes rating
Disney 6.971 49.425
Hulu 7.082 52.838
Netflix 7.111 53.559
Prime 7.153 37.761

Much like how the rating systems had different overall distributions, they also painted different pictures for the average rating of shows on each platform. Using IMDb, Prime has the highest average rating of 7.15/10 and Disney has the lowest with 6.97/10. Looking at the Rotten Tomatoes ratings, however, Netflix has the highest average rating of 53.6/100, Hulu has the highest median score of 55/100, and Prime has the lowest average rating of 37.8/100. Prime is particularly interesting in this respect since it has the highest average rating with IMDb, but the lowest average rating with Rotten Tomatoes. This demonstrates that the question of “which platform has the highest average rating?” is not so straightforward.

3.4.2 Age Groups by Platform

Next, we considered the relationship between platform and target age group. The frequency of target age groups differed significantly between different streaming platforms, thus making this a relevant feature to consider in later analysis.

Ratio
nrow(subset(disneytv, Age==“all”))/nrow(disneytv) 0.368
nrow(subset(hulutv, Age==“16+”))/nrow(hulutv) 0.309
nrow(subset(netflixtv, Age==“18+”))/nrow(netflixtv) 0.245
nrow(subset(primetv, Age==“7+”))/nrow(primetv) 0.116

Starting with Disney, the most frequent targeted age group was all ages, with 36.8% of the Disney TV shows listed falling into that category. For Hulu, 16+ was the most common age group at 30.9%. The most common age group for Netflix was 18+, comprising 24.5% of its TV shows. Prime was more evenly distributed amongst age groups (apart from 13+, which was very low), but 7+ was the most common age group at 11.6%.

After seeing how much each streaming platform differed when it comes to age group, it would have been interesting to explore other demographics for the audience of each platform. Relevant datta in this respect could include gender, race, the number of views, and the actual age of the viewer (as opposed to the target age group). Since this data set did not include these features, this can be considered another limitation of the data. Our main focus was the TV show ratings, but we could have learned more about user preference and built a more detailed model with that additional information.

3.5 Normality Check for IMDb and Rotten Tomatoes

We have found the average value of IMDb and Rotten Tomatoes rating. Now, we want to check whether the samples of these two variables are normally distributed or not. If it is normally distributed the mean and median of the variable will be the same.

3.5.1 Normality Check for the Variables IMDb and Rotten Tomatoes for Netflix

>>>>>>> 9313cefb098722bde3c9474f288dd5836cc101a6
## 
##  Shapiro-Wilk normality test
## 
## data:  netflixtv$IMDb
## W = 0.9, p-value <0.0000000000000002
## 
##  Shapiro-Wilk normality test
## 
## data:  netflixtv$Rotten.Tomatoes
## W = 1, p-value = 0.000000009
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
<<<<<<< HEAD

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Here p value for IMDb and Rotten.Tomatoes is less than 0.05 for Netflix. Histogram of IMDb is right-skewed also Histogram of Rotten.Tomatoes is slightly left-skewed. Thus, The mean and median are not equal. IMDb and Rotten Tomatoes ratings are not normally distributed for the Netflix platform.

3.3.2 Normality check for the variables IMDb and Rotten Tomatoes for Hulu

=======

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Here, p value for IMDb and Rotten.Tomatoes is less than 0.05 for Netflix. Histogram of IMDb is right-skewed also Histogram of Rotten.Tomatoes is slightly left-skewed. Thus, The mean and median are not equal. IMDb and Rotten Tomatoes ratings are not normally distributed for the Netflix platform.

3.5.2 Normality Check for the Variables IMDb and Rotten Tomatoes for Hulu

>>>>>>> 9313cefb098722bde3c9474f288dd5836cc101a6
## 
##  Shapiro-Wilk normality test
## 
## data:  hulutv$IMDb
## W = 0.9, p-value <0.0000000000000002
## 
##  Shapiro-Wilk normality test
## 
## data:  hulutv$Rotten.Tomatoes
## W = 1, p-value = 0.0000000000000006
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
<<<<<<< HEAD

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Here p value for IMDb and Rotten.Tomatoes is less than 0.05 for Hulu. Histogram of IMDb is right-skewed and histogram of Rotten Tomatoes is slightly Bimodal. Thus, the mean and median are not equal. IMDb and Rotten Tomatoes ratings are not normally distributed for the Hulu platform.

3.3.3 Normality check for the variables IMDb and Rotten Tomatoes for Prime tv

=======

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Here p value for IMDb and Rotten.Tomatoes is less than 0.05 for Hulu. Histogram of IMDb is right-skewed and histogram of Rotten Tomatoes is slightly Bimodal. Thus, the mean and median are not equal. IMDb and Rotten Tomatoes ratings are not normally distributed for the Hulu platform. ### Normality Check for the Variables IMDb and Rotten Tomatoes for Prime TV

>>>>>>> 9313cefb098722bde3c9474f288dd5836cc101a6
## 
##  Shapiro-Wilk normality test
## 
## data:  primetv$IMDb
## W = 0.9, p-value <0.0000000000000002
## 
##  Shapiro-Wilk normality test
## 
## data:  primetv$Rotten.Tomatoes
## W = 0.9, p-value <0.0000000000000002
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
<<<<<<< HEAD

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Here p value for IMDb and Rotten.Tomatoes is less than 0.05 for Prime Videos. Histogram of IMDb is right-skewed also Histogram of Rotten.Tomatoes is Randomly distribution. IMDb and Rotten Tomatoes ratings are not normally distributed for the prime tv platform.

3.3.4 Normality check for the variables IMDb and Rotten Tomatoes for Disney+

=======

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Here p value for IMDb and Rotten.Tomatoes is less than 0.05 for Prime Videos. Histogram of IMDb is right-skewed also Histogram of Rotten.Tomatoes is Randomly distribution. IMDb and Rotten Tomatoes ratings are not normally distributed for the prime tv platform.

3.5.3 Normality Check for the Variables IMDb and Rotten Tomatoes for Disney+

>>>>>>> 9313cefb098722bde3c9474f288dd5836cc101a6
## 
##  Shapiro-Wilk normality test
## 
## data:  disneytv$IMDb
## W = 1, p-value = 0.005
## 
##  Shapiro-Wilk normality test
## 
## data:  disneytv$Rotten.Tomatoes
## W = 1, p-value = 0.00009
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
<<<<<<< HEAD

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Here p value for IMDb and Rotten.Tomatoes is less than 0.05 for Disney+. Histogram of IMDb is slightly right-skewed and randomly distributed. also Histogram of Rotten.Tomatoes is slightly left-skewed and randomly distribution.IMDb and Rotten Tomatoes ratings are not normally distributed for the Disney+ platform.



3.4 SMART Questions

=======

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Here, the p-value for IMDb and Rotten Tomatoes is less than 0.05 for Disney+. Histogram of IMDb is slightly right-skewed and randomly distributed. Additionally, the histogram of Rotten Tomatoes is slightly left-skewed and randomly distributed.This leads to conclude that IMDb and Rotten Tomatoes ratings are not normally distributed for the Disney+ platform.

3.6 SMART Questions

>>>>>>> 9313cefb098722bde3c9474f288dd5836cc101a6

After an initial examination of our chosen data set, we decided on three SMART questions to focus on:

What are the most targeted age groups for the TV shows by Netflix, Hulu, PrimeVideo, Disney+?

Which year published the highest number of TV shows?

Which streaming platform has the highest average rating (according to Rotten Tomatoes and IMDb)?

The first question focuses on the relation between the target age group and the platform. By looking at the distribution of the column Age for each platform, we can obtain knowledge about the intended audience for Netflix, Hulu, Prime, and Disney. The second question focuses on the column Year. We quickly noticed that the range of years listed was larger than expected, spanning from 1904 to 2021. We expected that more recent years would have more listed TV shows, but we wanted to explore that distribution in more detail. Finally, the third question looks at the IMDb and Rotten Tomatoes ratings for each platform. While there were some TV shows that were available on more than one platform, we were interested in seeing how the overall ratings were distributed for each platform.

During our exploratory data analysis, we also came up with a fourth SMART question:

What is the relationship between IMDb and Rotten Tomatoes?

We had not initially considered the idea that the IMDb and Rotten Tomatoes rating systems would differ by much, but more in-depth analysis revealed that there were significant differences between how the two systems rated shows. This created an additional dimension of analysis that we did not initially anticipate.

<<<<<<< HEAD

4 Exploratory Data Analysis

4.1 Rating Systems

4.1.1 Frequency of Ratings

First, we wanted to use exploratory data analysis (EDA) to learn more about the two rating systems: IMDb and Rotten Tomatoes.

## [1] "Mean IMDb rating:"
## [1] 7.09
## [1] "Median IMDb rating:"
## [1] 7.3
## [1] "Mode IMDb rating:"
## [1] 7.4
## [1] "Mean RT rating:"
## [1] 47.2
## [1] "Median RT rating:"
## [1] 48
## [1] "Mode RT rating:"
## [1] 10
IMDb rating RT rating
Mean 7.086 7.086
Median 7.086 7.086
Mode 7.086 7.086

IMDb is a rating scale from 1 to 10, but the lowest rating on this list is 1.1 and the highest is 9.6. Rotten Tomatoes rates TV shows on a scale from 1-100 with a lowest score of 10 and a highest score of 100. As seen in the histogram plot, the distribution of IMDb ratings has a slight left skew. This is further exemplified by the fact that the median (blue) is larger than the mean (red). The Rotten Tomatoes ratings mean and median are almost equal, but there are some outliers in the data at the lower end of the rating system. The mode is a 10/100 with 304 shows receiving that rating. The Rotten Tomatoes rating is a combination of critics’ ratings and audience ratings, but this data set only shows the total rating, which is a limitation of this dataset. It would have been interesting to see how critics and the audience agree or disagree about certain ratings. In comparing these ratings distribution, it became obvious that IMDb, on average, gives higher ratings than Rotten Tomatoes. IMDb has a mean of 7.09/10 (70.9%) and a median of 7.3/10 (73%) while Rotten Tomatoes has a mean of 47.2/100 (47.2%) and a median of 48/100 (48%). This discrepancy was surprising since we expected the rating systems to generally agree.

4.1.2 Comparison of Rating Systems

In order to further explore this unexpected discrepancy, we created a scatter plot comparing IMDb and Rotten Tomatoes. We also added the dimension of Age, which is the intended age group of each show. This allows us to visualize how shows for different age groups are rated.

## [1] "Mean IMDb rating, 7+:"
## [1] 7.01
## [1] "Mean RT rating, 7+:"
## [1] 55
## [1] "Mean IMDb rating, 13+:"
## [1] 6.83
## [1] "Mean RT rating, 13+:"
## [1] 54.2
## [1] "Mean IMDb rating, 16+:"
## [1] 7.25
## [1] "Mean RT rating, 16+:"
## [1] 60.3
## [1] "Mean IMDb rating, 18+:"
## [1] 7.3
## [1] "Mean RT rating, 18+:"
## [1] 62.7
## [1] "Mean IMDb rating, all:"
## [1] 6.85
## [1] "Mean RT rating, all:"
## [1] 47.7
## [1] "Mean IMDb rating, NA:"
## [1] 6.96
## [1] "Mean RT rating, NA:"
## [1] 31.7

This scatter plot further confirmed the fact that IMDb had higher ratings when compared to Rotten Tomatoes. There is a positive correlation between the two rating systems, but Rotten Tomatoes consistently has a lower overall rating. This helped us to form a new SMART question: what is the relationship between IMDb and Rotten Tomatoes?

This plot also gave us some information about how shows for different age groups are rated. Shows intended for 16+ had the highest overall rating, with an average of 7.3/10 on IMDb and 62.7/100 on Rotten Tomatoes. The 13+ age group at the lowest average IMDb score of 6.83/10 and shows intended for all ages had the lowest Rotten Tomatoes score with 47.7/100. It should be noted that shows with no intended age group listed (NA) had a lower Rotten Tomatoes score of 31.7/100, but since that represents a dearth of data concerning age group, the subset was neglected in future analysis.

4.2 Streaming Platforms

4.2.1 Ratings for Platforms

After comparing the rating systems to each other and then seeing how the intended age group affects ratings, we then turned to the different platforms. We compared the ratings for Disney, Hulu, Netflix, and Prime using boxplots.

## [1] "Mean IMDb rating, Disney:"
## [1] 6.97
## [1] "Mean RT rating, Disney:"
## [1] 49.4
## [1] "Mean IMDb rating, Hulu:"
## [1] 7.08
## [1] "Mean RT rating, Hulu:"
## [1] 52.8
## [1] "Mean IMDb rating, Netflix:"
## [1] 7.11
## [1] "Mean RT rating, Netflix:"
## [1] 53.6
## [1] "Mean IMDb rating, Prime:"
## [1] 7.15
## [1] "Mean RT rating, Prime:"
## [1] 37.8

Much like how the rating systems had different overall distributions, they also painted different pictures for the average rating of shows on each platform. Using IMDb, Prime has the highest average rating of 7.15/10 and Disney has the lowest with 6.97/10. Looking at the Rotten Tomatoes ratings, however, Netflix has the highest average rating of 53.6/100, Hulu has the highest median score of 55/100, and Prime has the lowest average rating of 37.8/100. Prime is particularly interesting in this respect since it has the highest average rating with IMDb, but the lowest average rating with Rotten Tomatoes. This demonstrates that the question of “which platform has the highest average rating?” is not so straightforward.

4.2.2 Age Groups by Platform

Next, we considered the relationship between platform and target age group. The frequency of target age groups differed significantly between different streaming platforms, thus making this a relevant feature to consider in later analysis.

## [1] 0.368
## [1] 0.309
## [1] 0.245
## [1] 0.116

Starting with Disney, the most frequent targeted age group was all ages, with 36.8% of the Disney TV shows listed falling into that category. For Hulu, 16+ was the most common age group at 30.9%. The most common age group for Netflix was 18+, comprising 24.5% of its TV shows. Prime was more evenly distributed amongst age groups (apart from 13+, which was very low), but 7+ was the most common age group at 11.6%.

After seeing how much each streaming platform differed when it comes to age group, it would have been interesting to explore other demographics for the audience of each platform. Relevant datta in this respect could include gender, race, the number of views, and the actual age of the viewer (as opposed to the target age group). Since this data set did not include these features, this can be considered another limitation of the data. Our main focus was the TV show ratings, but we could have learned more about user preference and built a more detailed model with that additional information.

## 
## Call:
## lm(formula = IMDb ~ Rotten.Tomatoes * Platform, data = plot.data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -5.513 -0.491  0.085  0.623  3.096 
## 
## Coefficients:
##                                 Estimate Std. Error t value
## (Intercept)                      5.33532    0.23088   23.11
## Rotten.Tomatoes                  0.03118    0.00427    7.29
## PlatformHulu                    -0.79742    0.25511   -3.13
## PlatformNetflix                 -0.32703    0.24834   -1.32
## PlatformPrime                    0.15766    0.25105    0.63
## Rotten.Tomatoes:PlatformHulu     0.01290    0.00465    2.77
## Rotten.Tomatoes:PlatformNetflix  0.00703    0.00457    1.54
## Rotten.Tomatoes:PlatformPrime    0.00189    0.00467    0.40
##                                             Pr(>|t|)    
## (Intercept)                     < 0.0000000000000002 ***
## Rotten.Tomatoes                     0.00000000000035 ***
## PlatformHulu                                  0.0018 ** 
## PlatformNetflix                               0.1879    
## PlatformPrime                                 0.5301    
## Rotten.Tomatoes:PlatformHulu                  0.0055 ** 
## Rotten.Tomatoes:PlatformNetflix               0.1240    
## Rotten.Tomatoes:PlatformPrime                 0.6865    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.974 on 4782 degrees of freedom
##   (984 observations deleted due to missingness)
## Multiple R-squared:  0.241,  Adjusted R-squared:  0.24 
## F-statistic:  217 on 7 and 4782 DF,  p-value: <0.0000000000000002
## 
## Call:
## lm(formula = IMDb ~ Rotten.Tomatoes + Platform, data = plot.data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -5.514 -0.493  0.097  0.632  2.937 
## 
## Coefficients:
##                  Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)      4.970519   0.075666   65.69 < 0.0000000000000002 ***
## Rotten.Tomatoes  0.038133   0.000991   38.49 < 0.0000000000000002 ***
## PlatformHulu    -0.089512   0.061014   -1.47                 0.14    
## PlatformNetflix  0.041857   0.059483    0.70                 0.48    
## PlatformPrime    0.268076   0.061925    4.33             0.000015 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.976 on 4785 degrees of freedom
##   (984 observations deleted due to missingness)
## Multiple R-squared:  0.238,  Adjusted R-squared:  0.237 
## F-statistic:  373 on 4 and 4785 DF,  p-value: <0.0000000000000002
## 
## Call:
## lm(formula = IMDb ~ Rotten.Tomatoes, data = plot.data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -5.549 -0.487  0.088  0.623  2.760 
## 
## Coefficients:
##                 Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)      5.11965    0.05524    92.7 <0.0000000000000002 ***
## Rotten.Tomatoes  0.03642    0.00098    37.2 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.984 on 4788 degrees of freedom
##   (984 observations deleted due to missingness)
## Multiple R-squared:  0.224,  Adjusted R-squared:  0.224 
## F-statistic: 1.38e+03 on 1 and 4788 DF,  p-value: <0.0000000000000002

5 Distributions and tests

5.1 Smart Question: Which streaming platform has the highest average rating (according to Rotten Tomatoes and IMDb)?

On our dataset Netflix, Hulu, Prime tv, Disney+ are four streaming platforms. To find the highest average rating according to Rotten Tomatoes and IMDb among the four streaming platforms T-test is chosen for. A t-test is a type of inferential statistic used to compare the means of two groups. By conducting t-test, average rating value (mean value) of all streaming platforms has been found. Mean values from t-test are analyzed to find the highest average rating value.

=======

4 Distributions and Tests

4.1 Smart Question: Which streaming platform has the highest average rating (according to Rotten Tomatoes and IMDb)?

On our dataset, Netflix, Hulu, Prime tv, and Disney+ are our streaming platforms. To find the highest average rating according to Rotten Tomatoes and IMDb among the four streaming platforms a T-test is conducted. A t-test is a type of inferential statistic used to compare the means of two groups. By conducting t-test, average rating value (mean value) of all streaming platforms was found. Mean values from t-test are analyzed to find the highest average rating value.

>>>>>>> 9313cefb098722bde3c9474f288dd5836cc101a6
## 
##  Welch Two Sample t-test
## 
## data:  netflixtv$IMDb and netflixtv$Rotten.Tomatoes
## t = -134, df = 1990, p-value <0.0000000000000002
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -47.1 -45.8
## sample estimates:
## mean of x mean of y 
##      7.11     53.56
## 
##  Welch Two Sample t-test
## 
## data:  hulutv$IMDb and hulutv$Rotten.Tomatoes
## t = -98, df = 1635, p-value <0.0000000000000002
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -46.7 -44.8
## sample estimates:
## mean of x mean of y 
##      7.08     52.84
## 
##  Welch Two Sample t-test
## 
## data:  primetv$IMDb and primetv$Rotten.Tomatoes
## t = -62, df = 1846, p-value <0.0000000000000002
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -31.6 -29.6
## sample estimates:
## mean of x mean of y 
##      7.15     37.76
## 
##  Welch Two Sample t-test
## 
## data:  disneytv$IMDb and disneytv$Rotten.Tomatoes
## t = -51, df = 354, p-value <0.0000000000000002
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -44.1 -40.8
## sample estimates:
## mean of x mean of y 
##      6.97     49.42
<<<<<<< HEAD

Prime Videos has the highest average IMDb rating which is 7.152538 among the all-streaming platforms. Netflix has the highest average Rotten Tomatoes rating which is 53.559107 among the all-streaming platforms. This is how the highest average rating is found.

5.1.1 SMART Question: Do the rating IMDb and Rotten Tomatoes depend on age variable?

We want to check the rating of streaming platforms on IMDb and Rotten Tomatoes are somehow related to the age of the audience. We conduct a chi-square test to check whether IMDb and Rotten Tomatoes are independent. H0: Age and rating are independent from each other.

H1: Age and rating are not independent from each other.

5.2 Independence check for Netflix platform

=======

Prime Video has the highest average IMDb rating, which is 7.152538. This among all streaming platforms. Netflix has the highest average Rotten Tomatoes rating which is 53.559107 among all streaming platforms.

4.1.1 SMART Question: Do the rating IMDb and Rotten Tomatoes depend on age variable?

We want to check if the rating of streaming platforms on IMDb and Rotten Tomatoes are somehow related to the age of the audience. We conduct a chi-square test to check whether IMDb and Rotten Tomatoes are independent. The following are the null and alternative hypothesis. H0: Age and rating are independent from each other. H1: Age and rating are not independent from each other.

4.2 Independence check for Netflix Platform

>>>>>>> 9313cefb098722bde3c9474f288dd5836cc101a6
##      
##       1.1 2.5 2.7 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9  4 4.1 4.2 4.3 4.4 4.5 4.6
##   16+   1   0   0   0   0   1   2   0   0   0   0  0   0   0   0   1   1   2
##   18+   0   1   1   0   0   0   0   1   0   1   0  0   1   0   0   0   0   4
##   7+    0   0   0   0   0   0   0   1   1   1   1  0   1   2   2   0   1   0
##   all   0   0   0   1   2   0   0   0   1   1   1  1   0   1   0   1   1   1
##      
##       4.7 4.8 4.9  5 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9  6 6.1 6.2 6.3 6.4 6.5
##   16+   0   1   0  0   0   1   1   4   1   0   0   2   1  4   3   8   3   5   6
##   18+   1   1   1  5   1   0   3   5   3   6   4   5   8  5   6  15   7  17  16
##   7+    0   1   3  1   0   3   1   1   6   1   7   4   4  8   3   4   5   5  10
##   all   0   1   5  2   1   2   2   4   1   2   4   1   5  4   1   8   3   6   5
##      
##       6.6 6.7 6.8 6.9  7 7.1 7.2 7.3 7.4 7.5 7.6 7.7 7.8 7.9  8 8.1 8.2 8.3 8.4
##   16+   8  13  12  11  7  17  21  13  29  21  14  24  17  13 18  19  11  16  13
##   18+  14  13  21  16 13  17  23  17  15  23  22  16  19  27 19   9  14  14  15
##   7+    9  11  10  10 15  10  12  11  13  15  12   5   8  10 12  12   7   9  10
##   all   8   7  10   7  2   6   7   4   8   5   3   5   4   2  4   4   2   2   2
##      
##       8.5 8.6 8.7 8.8 8.9  9 9.1 9.3 9.4
##   16+  10   6   8   5   0  4   2   0   0
##   18+   9  11   6   6   1  1   1   0   1
##   7+    6   4   2   4   0  0   1   1   0
##   all   3   3   2   0   0  0   0   1   0
## 
##  Pearson's Chi-squared test
## 
## data:  contable1
## X-squared = 259, df = 192, p-value = 0.0009
##      
##       16 21 22 23 27 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
##   16+  0  0  1  1  0  0  1  2  0  0  2  2  0  3  1  2  3  3  4  3  7  7  3  6
##   18+  0  0  1  0  0  0  1  1  0  0  1  2  2  0  2  4 11  1  6  6  9  7  6 11
##   7+   1  0  0  0  1  1  2  0  1  2  1  2  6  4  7  7 10  5  7  9  8 12  6  6
##   all  1  2  0  0  0  1  0  4  1  2  4  5  5  4  4 18 10  5  8  6 12  9  7  5
##      
##       48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
##   16+  9  4  8  7 12 10 14 17  8 11  6 12 13  6 17  9  7 15 10  9 10  7  8  8
##   18+  9  9 13  7 16 16  9  7 19 12 11  9 19 22  8 15 13 14 10 17  8  7 11 10
##   7+   7  8 10 12 12  8  8 12  6  8  4  5  7  8  4  9  7  7  4  7  3  8  4  3
##   all  6  5  5  5  6  8  2  2  2  4  1  2  0  1  3  3  1  1  0  1  0  0  0  1
##      
##       72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 92 93 94 95 96
##   16+  4 10  8  5  4  9  5  3  8  6  3  4  6  1  7  3  5  2  1  0  1  0  0  1
##   18+ 14  9  9  3  6  9  4  8  4  4  7  6  4  7  7  2  2  3  5  2  2  1  1  0
##   7+   3  3  3  1  1  5  1  2  2  1  2  2  0  1  1  0  0  0  1  0  1  0  0  0
##   all  0  0  0  2  0  0  0  0  1  0  1  0  0  0  0  0  0  0  0  0  0  0  0  0
##      
##       100
##   16+   0
##   18+   1
##   7+    0
##   all   0
## 
##  Pearson's Chi-squared test
## 
## data:  contable2
## X-squared = 448, df = 216, p-value <0.0000000000000002
<<<<<<< HEAD

Since the p-value for IMDB is 0.009 AND p-value for Rotten Tomatoes0.0000000000000002, which are lower than 0.05, we need to reject the null hypothesis. Thus, IMDb and Rotten Tomatoes ratings for Netflix are not independent. Age and rating are correlated for Netflix platform.

5.3 Independence check for Hulu platform

=======

Since the p-value for IMDB is 0.009, and the p-value for Rotten Tomatoes is 0.0000000000000002, which are lower than 0.05, we need to reject the null hypothesis. Thus, IMDb and Rotten Tomatoes ratings for Netflix are not independent, meaning age and rating are correlated for Netflix platform.

4.3 Independence Check for Hulu Platform

>>>>>>> 9313cefb098722bde3c9474f288dd5836cc101a6
## 
##  Pearson's Chi-squared test
## 
## data:  contable3
## X-squared = 230, df = 207, p-value = 0.1
## 
##  Pearson's Chi-squared test
## 
## data:  contable4
## X-squared = 277, df = 210, p-value = 0.001
<<<<<<< HEAD

Since the p-value for IMDb is 0.1, which is greater than 0.05, we need to accept the null hypothesis. Thus, IMDb for Hulu is independent. Age and IMDb rating are correlated for Hulu. the p-value is 0.001, which is lower than 0.05, we need to reject the null hypothesis. Thus, Rotten Tomatoes rating for Hulu is not independent. Age and Rotten Tomatoes ratings are correlated for Hulu platform.

5.4 Independence check for Prime tv platform

=======

Since the p-value for IMDb is 0.1, which is greater than 0.05, we need to accept the null hypothesis. Thus, IMDb for Hulu is independent. Age and IMDb rating are correlated for Hulu. Meanwhile, with a the p-value of 0.001, which is lower than 0.05, leads us to reject the null hypothesis. Thus, Rotten Tomatoes rating for Hulu is not independent. We then conclude that Age and Rotten Tomatoes ratings are correlated for Hulu platform.

4.4 Independence Check for Prime TV Platform

>>>>>>> 9313cefb098722bde3c9474f288dd5836cc101a6
## 
##  Pearson's Chi-squared test
## 
## data:  contable5
## X-squared = 186, df = 171, p-value = 0.2
## 
##  Pearson's Chi-squared test
## 
## data:  contable6
## X-squared = 320, df = 201, p-value = 0.0000002
<<<<<<< HEAD

Since the p-value for IMDb is 0.2, which is greater than 0.05, we need to accept the null hypothesis. Thus, IMDb for primetv is independent. Age and IMDb rating are correlated for primetv. the p-value is 0.0000002, which is lower than 0.05, we need to reject the null hypothesis. Thus, Rotten Tomatoes rating for primetv is not independent. Age and Rotten Tomatoes ratings are correlated for primetv platform.

5.5 Independence check for Disney+ platform

=======

Since the p-value for IMDb is 0.2, which is greater than 0.05, we need to accept the null hypothesis. Thus, IMDb for Primetv is independent. On the other hand, Age and IMDb rating are correlated for primetv. Meanwhile, the p-value of 0.0000002, which is lower than 0.05, leads us to reject the null hypothesis. Thus, Rotten Tomatoes rating for Primetv is not independent, concluding Age and Rotten Tomatoes ratings are correlated for Primetv platform.

4.5 Independence Check for Disney+ Platform

>>>>>>> 9313cefb098722bde3c9474f288dd5836cc101a6
## 
##  Pearson's Chi-squared test
## 
## data:  contable7
## X-squared = 246, df = 156, p-value = 0.000006
## 
##  Pearson's Chi-squared test
## 
## data:  contable8
## X-squared = 227, df = 180, p-value = 0.01
<<<<<<< HEAD

Since the p-value for IMDB is 0.000006 AND p-value for Rotten Tomatoes 0.01, which are lower than 0.05, we need to reject the null hypothesis. Thus, IMDb and Rotten Tomatoes ratings for Diseney+ are not independent. Age and rating are correlated for Disney+ platform.


This step is to clear data, change some string data to number and some to factor.

## 'data.frame':    5368 obs. of  12 variables:
##  $ X              : int  0 1 2 3 4 5 6 7 8 9 ...
##  $ ID             : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Title          : chr  "Breaking Bad" "Stranger Things" "Attack on Titan" "Better Call Saul" ...
##  $ Year           : int  2008 2016 2013 2015 2017 2005 2013 2010 2011 2020 ...
##  $ Age            : Factor w/ 5 levels "13+","16+","18+",..: 3 2 3 3 2 4 3 3 3 3 ...
##  $ IMDb           : num  9.4 8.7 9 8.8 8.8 9.3 8.8 8.2 8.8 8.6 ...
##  $ Rotten.Tomatoes: num  100 96 95 94 93 93 93 93 92 92 ...
##  $ Netflix        : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Hulu           : Factor w/ 2 levels "0","1": 1 1 2 1 1 1 1 1 1 1 ...
##  $ Prime.Video    : Factor w/ 2 levels "0","1": 1 1 1 1 1 2 1 1 1 1 ...
##  $ Disney.        : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Type           : int  1 1 1 1 1 1 1 1 1 1 ...

6 Multiple Linear Regression Model

6.1 SMART Question: What is the relationship between IMDb and Rotten Tomatoes?

6.1.1 Preparation work

6.1.1.1 Drop NaN data

To analyze the score relationships between two platforms, we need to drop the rows that have no IMDb scores and no Rotten.Tomatoes scores. Meanwhile, the column of X, ID, Title and Type are useless, so we also drop these columns.

6.1.1.2 Make a pairs() plot with all the variables (quantitative and qualitative)

6.1.1.3 Make a corrplot() with only the numerical variables

## corrplot 0.92 loaded

6.1.2 Linear regression model build

6.1.2.1 Simple linear model

By using the variable Rotten.Tomatoes only, build a linear model with 1 independent variable to predict the IMDb.

=======

Since the p-value for IMDB is 0.000006, and p-value for Rotten Tomatoes is 0.01, which are both lower than 0.05, we need to reject the null hypothesis. Thus, IMDb and Rotten Tomatoes ratings for Diseney+ are not independent, meaning Age and rating are correlated for Disney+ platform.

5 Multiple Linear Regression Model

5.1 SMART Question: What is the relationship between IMDb and Rotten Tomatoes?

5.1.1 Preparation Work

5.1.1.1 Dropping NaN Data and Building Histograms

To analyze the score relationships between two platforms, we need to drop the rows that have no IMDb scores and no Rotten.Tomatoes scores. Additionally, the column of X, ID, Title and Type are useless, so we also drop these columns. After we finished the preparation work, Let’s have a look.

5.1.1.2 Plotting a pairs() with all the Variables (Quantitative and Qualitative)

5.1.1.3 Plotting a corrplot() with Only the Numerical Variables

5.1.2 Builsing the Linear Regression Model

5.1.2.1 Simple Linear Model

By using the variable Rotten.Tomatoes only, we built a linear model with 1 independent variable to predict the IMDb.

>>>>>>> 9313cefb098722bde3c9474f288dd5836cc101a6
## 
## Call:
## lm(formula = IMDb ~ Rotten.Tomatoes, data = tvdata_rank)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -5.562 -0.495  0.087  0.628  2.746 
## 
## Coefficients:
##                 Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)      5.15419    0.05785    89.1 <0.0000000000000002 ***
## Rotten.Tomatoes  0.03590    0.00104    34.6 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.989 on 4404 degrees of freedom
## Multiple R-squared:  0.213,  Adjusted R-squared:  0.213 
## F-statistic: 1.19e+03 on 1 and 4404 DF,  p-value: <0.0000000000000002
Model (num): IMDb ~ Rotten.Tomatoes
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.1542 0.0579 89.1 0
Rotten.Tomatoes 0.0359 0.0010 34.6 0
<<<<<<< HEAD

From the results above, We can find there is a weak correlation between IMDb and Rotten.Tomatoes. And the correlation coefficient is 0.213.

6.1.2.2 A variable is added

Because there is only a weak correlation, I try to add the other variables into the model. Below is the result:

=======

From the results above, we found there is a weak correlation between IMDb and Rotten.Tomatoes, with a the correlation coefficient of 0.213.

5.1.2.2 Adding a Variable

Because there is only a weak correlation, We tried to add the other variables into the model. Below is the result:

>>>>>>> 9313cefb098722bde3c9474f288dd5836cc101a6
## 
## Call:
## lm(formula = IMDb ~ ., data = tvdata_rank)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -5.361 -0.465  0.080  0.586  2.579 
## 
## Coefficients:
##                 Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)     13.68924    3.61092    3.79              0.00015 ***
## Year            -0.00449    0.00178   -2.52              0.01194 *  
## Age16+           0.14321    0.30940    0.46              0.64349    
## Age18+           0.07727    0.31010    0.25              0.80323    
## Age7+            0.11428    0.30955    0.37              0.71201    
## Ageall           0.22263    0.31096    0.72              0.47409    
## Rotten.Tomatoes  0.04290    0.00131   32.72 < 0.0000000000000002 ***
## Netflix1        -0.10325    0.05598   -1.84              0.06521 .  
## Hulu1           -0.24195    0.05480   -4.42              0.00001 ***
## Prime.Video1     0.09515    0.05565    1.71              0.08740 .  
## Disney.1        -0.24240    0.07845   -3.09              0.00202 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.923 on 3196 degrees of freedom
##   (1199 observations deleted due to missingness)
## Multiple R-squared:  0.285,  Adjusted R-squared:  0.283 
## F-statistic:  128 on 10 and 3196 DF,  p-value: <0.0000000000000002
<<<<<<< HEAD

We can find that the adjusted correlation coefficient increases, but some variables are not significant, so try to drop it.

6.1.2.3 Drop sparse age variables

## 
## 13+ 16+ 18+  7+ all 
##   9 987 852 824 535

The result is that for age13+, there are only 9 shows. Too little sample cause a large p-value. So we need to drop the factor of Age13+. We also drop Netflix, Prime.Video, Disney., because the p-values of these variables are not significant.

6.1.2.4 Linear model with three variables

And then me make the third model as follow.

=======

We can see that the adjusted correlation coefficient increases, but some variables are not significant, so try we then tried to drop it.

5.1.2.3 Dropping Sparse Age Observations

## 
## 13+ 16+ 18+  7+ all 
##   9 987 852 824 535

As a result, we obtained that for age13+, there are only 9 shows. This is a small sample with a large p-value. In this sense, we dropped the observations of Age13+, as well as Netflix, Prime.Video, Disney. This changes were made due to the p-values, which are not significant.

5.1.2.4 Linear Model with Three Variables

From the previous results, we then made the third model as follow.

>>>>>>> 9313cefb098722bde3c9474f288dd5836cc101a6
## 
## Call:
## lm(formula = IMDb ~ ., data = tvdata_rank_no13)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -5.417 -0.467  0.087  0.585  2.756 
## 
## Coefficients:
##                 Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)     18.52238    3.47216    5.33          0.000000102 ***
## Year            -0.00681    0.00172   -3.97          0.000074207 ***
## Age18+          -0.06490    0.04422   -1.47                 0.14    
## Age7+           -0.05724    0.04503   -1.27                 0.20    
## Ageall           0.03145    0.05450    0.58                 0.56    
## Rotten.Tomatoes  0.04195    0.00130   32.30 < 0.0000000000000002 ***
## Hulu1           -0.20024    0.03555   -5.63          0.000000019 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.927 on 3191 degrees of freedom
## Multiple R-squared:  0.277,  Adjusted R-squared:  0.276 
## F-statistic:  204 on 6 and 3191 DF,  p-value: <0.0000000000000002
<<<<<<< HEAD

We can find that all variables are significant and the adjusted r-squared is 0.246, much higher than the simple linear regression.

=======

We found that all variables are significant and the adjusted r-squared is 0.246, which is much higher than the simple linear regression.

>>>>>>> 9313cefb098722bde3c9474f288dd5836cc101a6
Model (num): IMDb ~ Year + Age + Rotten.Tomatoes + Hulu
Estimate Std. Error t value Pr(>|t|)
(Intercept) 18.5224 3.4722 5.335 0.0000
Year -0.0068 0.0017 -3.968 0.0001
Age18+ -0.0649 0.0442 -1.468 0.1423
Age7+ -0.0572 0.0450 -1.271 0.2038
Ageall 0.0315 0.0545 0.577 0.5639
Rotten.Tomatoes 0.0420 0.0013 32.300 0.0000
Hulu1 -0.2002 0.0355 -5.633 0.0000
VIFs of the model
Age18+ Age7+ Ageall Hulu1 Rotten.Tomatoes Year
1.17 1.42 1.44 1.54 1.2 1.1
<<<<<<< HEAD

This is an Error because “models were not all fitted to the same size of dataset”, so we redo the simple linear regression using tvdata_rank_no13 as model4.

6.1.2.5 Linear regression model rebuild

=======

We found this error since “models were not all fitted to the same size of dataset”, so we redid the simple linear regression using tvdata_rank_no13 as model4.

5.1.2.5 Rebuilding Linear Regression Model

>>>>>>> 9313cefb098722bde3c9474f288dd5836cc101a6
## 
## Call:
## lm(formula = IMDb ~ Rotten.Tomatoes, data = tvdata_rank_no13)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -5.396 -0.459  0.075  0.581  2.811 
## 
## Coefficients:
##                 Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)       4.7863     0.0709    67.5 <0.0000000000000002 ***
## Rotten.Tomatoes   0.0407     0.0012    34.0 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.934 on 3196 degrees of freedom
## Multiple R-squared:  0.266,  Adjusted R-squared:  0.266 
## F-statistic: 1.16e+03 on 1 and 3196 DF,  p-value: <0.0000000000000002
Model (num): IMDb ~ Rotten.Tomatoes
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.7863 0.0709 67.5 0
Rotten.Tomatoes 0.0407 0.0012 34.0 0

The concepts above can be extended naturally to models with interactions between numeric and factor variables.

## 
## Call:
## lm(formula = IMDb ~ . + Rotten.Tomatoes:Age, data = tvdata_rank_no13)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -5.460 -0.457  0.069  0.579  2.788 
## 
## Coefficients:
##                        Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)            20.68621    3.51095    5.89         0.0000000042 ***
## Year                   -0.00781    0.00174   -4.49         0.0000073150 ***
## Age18+                 -0.45763    0.20041   -2.28               0.0225 *  
## Age7+                  -0.52436    0.20025   -2.62               0.0089 ** 
## Ageall                  0.58862    0.23340    2.52               0.0117 *  
## Rotten.Tomatoes         0.03930    0.00218   18.03 < 0.0000000000000002 ***
## Hulu1                  -0.20345    0.03549   -5.73         0.0000000108 ***
## Age18+:Rotten.Tomatoes  0.00639    0.00317    2.01               0.0440 *  
## Age7+:Rotten.Tomatoes   0.00814    0.00340    2.39               0.0168 *  
## Ageall:Rotten.Tomatoes -0.01237    0.00445   -2.78               0.0055 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.924 on 3188 degrees of freedom
## Multiple R-squared:  0.283,  Adjusted R-squared:  0.281 
## F-statistic:  140 on 9 and 3188 DF,  p-value: <0.0000000000000002
Model (num): IMDb ~ Year + Age + Rotten.Tomatoes + Hulu + Rotten.Tomatoes:Age
Estimate Std. Error t value Pr(>|t|)
(Intercept) 20.6862 3.5110 5.89 0.0000
Year -0.0078 0.0017 -4.49 0.0000
Age18+ -0.4576 0.2004 -2.28 0.0225
Age7+ -0.5244 0.2003 -2.62 0.0089
Ageall 0.5886 0.2334 2.52 0.0117
Rotten.Tomatoes 0.0393 0.0022 18.03 0.0000
Hulu1 -0.2034 0.0355 -5.73 0.0000
Age18+:Rotten.Tomatoes 0.0064 0.0032 2.02 0.0440
Age7+:Rotten.Tomatoes 0.0081 0.0034 2.39 0.0168
Ageall:Rotten.Tomatoes -0.0124 0.0045 -2.78 0.0055
<<<<<<< HEAD

6.1.2.6 Comparation with these three models

=======

5.1.2.6 Comparison with the Three Models

>>>>>>> 9313cefb098722bde3c9474f288dd5836cc101a6
## Analysis of Variance Table
## 
## Model 1: IMDb ~ Rotten.Tomatoes
## Model 2: IMDb ~ Year + Age + Rotten.Tomatoes + Hulu
## Model 3: IMDb ~ Year + Age + Rotten.Tomatoes + Hulu + Rotten.Tomatoes:Age
##   Res.Df  RSS Df Sum of Sq    F       Pr(>F)    
## 1   3196 2787                                   
## 2   3191 2744  5      43.2 10.1 0.0000000013 ***
## 3   3188 2724  3      20.2  7.9 0.0000300757 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
ANOVA comparison between the models
Res.Df RSS Df Sum of Sq F Pr(>F)
3196 2787 NA NA NA NA
3191 2744 5 43.2 10.1 0
3188 2724 3 20.2 7.9 0
<<<<<<< HEAD

Add one interaction, full model seems quite nice (although it’s still a weak correlation).

6.2 Final results and approved model for prediction of price

IMDb = 23.464467 + -0.009186 * Year + 0.046692 * Rotten.Tomatoes + -0.209429 * Hulu + (-0.516497 + 0.000526 * Rotten.Tomatoes) * Age7+ + (0.005813 + -0.007472 * Rotten.Tomatoes) * Age16+ + (-0.453837 + -0.001010 * Rotten.Tomatoes) * Age18+ + (0.609364 + -0.020359 * Rotten.Tomatoes) * Ageall

6.3 Problem and future researches

6.3.1 Weak correlation

The weak correlation between IMDb and Rotten Tomatoes means that there are other variables that we still don’t know. To find other variables that influence the results can be a focus of future researches

6.3.2 Left-skewed IMDb

As the IMDb histogram above is a little left-skewed, it means there are many outliers whose IMDb is very low in the dataset. While we built the model, we did not exclude the outliers as we considered these values important.

7 Conclusion

Some of the world’s largest entertainment giants have ventured on streaming entertainment during the previous decade, including Netflix, Hulu, Prime Video, and Disney+. As more individuals are compelled to stay at home to prevent the spread of the new coronavirus, the idea of a bored, cable-cutting consumer looking for shows, documentaries or series to watch for weeks on end has become a reality. As a result, TV shows found on Netflix, Hulu, Prime video, and Disney+ is our selected topic for the project. We completed an analysis of the rate of TV shows that have been streaming over time, the most popular streaming platform, and targeted audience will be conducted. The source of the dataset is Kaggle Sample Dataset where it was extracted as a CSV format. The data consists of 5367 observations and 11 variables (ID, Title, Year, Age, IMDb, Rotten Tomatoes, Netflix, Hulu, Prime Video) The dataset constitutes data of various types like numerical and categorical.

SMART QUESTIONS:

1.What are the most targeted age groups for the TV shows by Netflix, Hulu, Prime Video?

2.Which year published the highest number of TV shows?

3.Which streaming platform has the highest average rating (according to Rotten Tomatoes and IMDb)?

4.What is the relationship between IMDb and Rotten Tomatoes?

=======

With one interaction added, the full model seems quite nice, although it’s still a weak correlation.

5.2 Final Results and Approved Model for Prediction of Price

IMDb = 23.464467 + -0.009186 * Year + 0.046692 * Rotten.Tomatoes + -0.209429 * Hulu + (-0.516497 + 0.000526 * Rotten.Tomatoes) * Age7+ + (0.005813 + -0.007472 * Rotten.Tomatoes) * Age16+ + (-0.453837 + -0.001010 * Rotten.Tomatoes) * Age18+ + (0.609364 + -0.020359 * Rotten.Tomatoes) * Ageall

5.3 Problem and Future Research

5.3.1 Weak Correlation

The weak correlation between IMDb and Rotten Tomatoes means that there are other variables that we still don’t know. To find other variables that influence the results can be a focus of future researches

5.3.2 Left-Skewed IMDb

As the IMDb histogram above is a little left-skewed, it means there are many outliers. While we built the model, we did not exclude the outliers as we considered these values important.

6 Conclusion

>>>>>>> 9313cefb098722bde3c9474f288dd5836cc101a6

The independent variables we were considering include the target age group and the straming platform. Our depenent variable was the TV show rating, including both IMDb and Rotten Tomatoes. Our analysis provided insights into people’s preference of TV shows in different platforms.

After data cleaning and EDA on most variables in the dataset, we found that 16+ is the most targeted age for TV shows, followed by 18+, which means the most target people is adolescence and young adults.

We also found that the target age group was highly dependent on the streaming platform. Disney TV shows catered to all ages, Netflix and Hulu focused on 18+, and Prime was more varied across multiple groups.

Looking at the years during which TV shows were released, more and more TV shows are created in recent years and 2017 is the peak. There were very new shows produced during the 20th century, a majority listed were created in the past 20 years.

As we proceeded with hypothesis analysis of different platforms and ratings, we found that IMDb has higher average rating than Rotten Tomatoes and there is only a weak correlation betwween the two rating systems. There is a positive correlation between IMDb and Rotten Tomatoes, but they have different distributions overall: IMDb has a higher average and is left skewed while Rotten Tomatoes tends to be lower and has low-value outliers. In addition to that, IMDb and Rotten Tomatoes disagree about the highest rated platforms. Using IMDb, Prime has both the highest mean and median rating. According to Rotten Tomatoes, however, Prime has the lowest rating, Hulu has the highest median rating, and Netflix has the highest mean rating.

As we were exploring possible linear models, relating the rating systems to platform, age group, and year of creation, we found a correlation between age and Rotten Tomatoes using a linear model. This relation can be seen in our final results.

<<<<<<< HEAD

8 Bibliography

=======

7 References

Kaggle, “Movies on Netflix, Prime Video, Hulu and Disney+. A collection of movies found on these streaming platforms”, Mar 1 2020, https://www.kaggle.com/ruchi798/movies-on-netflix-prime-video-hulu-and-disney

Statista, 2021, “Market shares of selected Subscription video-on-demand (SVOD) services in the United States in 2020”. Mar 1 2020, https://www.statista.com/statistics/496011/usa-svod-to-tv-streaming-usage/

The Verge, 2020“In a streaming wars world, JustWatch has become an essential tool” Mar 1 2020, https://www.theverge.com/2020/6/18/21291519/justwatch-streaming-search-engine-filter-app-impressions-hbo-max-disney-plus-netflix-hulu

>>>>>>> 9313cefb098722bde3c9474f288dd5836cc101a6